Data Exploration and Visualization

Data Analytics for Finance

Caspar David Peter

Rotterdam School of Management, Accounting Department

Data Exploration and Visualization

Data Exploration and Visualization

Today’s Journey

  • Why visualize? The power and peril
  • Describing variables: Building blocks
  • Design principles that matter
  • Tables: The unsung hero

Literature

  • Huntington-Klein (2022) (Chapters 3 & 4)

Data Exploration and Visualization

Recap - Where did we leave things?

Last time: The Research Pipeline

Raw & Clean Data

  • Data acquisition (WRDS, identifiers)
  • Merging datasets (CUSIP, GVKEY, PERMNO)
  • Understanding and creating tidy data
Today’s Menu
  • Data exploration & visualization
  • Descriptive statistics
  • Effective communication of results

Today we learn how to LOOK at what we’ve built

Data Exploration and Visualization

Learning Objectives

By the end of today, you will be able to

  • Explain why visualization matters for research AND practice
  • Identify misleading visualizations and graph abuse
  • Describe distributions and relationships effectively
  • Apply core design principles to financial data
  • Create publication-quality summary statistics tables

Why Visualize? The Power and Peril

Why Visualize? The Power and Peril

Visualization in Practice

Just Eat Takeaway Analyst Presentation

What do you notice?

Key observation

Almost every slide contains a visualization!

Key question

Why do they do this?

Visualization in Practice

Growth Story

Gross Trsansaction Value Over Time

What story does this tell?

  • “~2X growth since pre-pandemic”
  • Clear trend, immediate impact
  • Visual > table of numbers

Visualization in Practice

Geographic Breakdown

Market Leadership Across Regions

Multiple dimensions in one view

  • Geography
  • Market position
  • Revenue by segment

Visualization in Practice

Comparative Performance

EBITDA by region over time

What does this show?

  • Multiple entities (regions)
  • Time dimension
  • Positive vs. negative values
  • Clear color coding

Why Visualize? The Power and Peril

The Case FOR Visualization

What is visualization (vis)?

Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively (Munzner 2025).

Why visualize?

  • Communication: Tell data-driven stories
  • Verification: Confirm expected patterns
  • Exploration: Find unexpected patterns
  • Falsification: Assess validity of models

Why Visualize? The Power and Peril

But… Visualization Can Mislead

Why Visualize? The Power and Peril

But… Visualization Can Mislead

When keepin’ it vis goes wrong

OpenAI

Anthropic

Why Visualize? The Power and Peril

But… Visualization Can Mislead

When keepin’ it vis goes wrong

Pizza toppings in the UK

Female height distribution

Why Visualize? The Power and Peril

Insights from Beattie and Jones (1992)

The use and abuse of graphs in annual reports: theoretical framework and empirical study

  • Study of 240 large UK companies’ annual reports
  • Average: 5.9 graphs per report
  • Key finding: Companies with ‘good’ performance significantly more likely to use graphs
  • Selectivity: 65% graph at least one key financial variable
  • But which variables do they choose?

Why Visualize? The Power and Peril

Two Forms of Distortion

Beattie and Jones (1992) Framework

Selectivity

Choosing WHICH data to visualize

  • Graph the good news, table the bad news
  • Example: Graph revenue growth, table margin decline

Measurement distortion

HOW data is presented

  • Axis manipulation
  • Visual exaggeration
  • Found in 30% of graphs!
  • Average exaggeration: 10.7%

Why Visualize? The Power and Peril

Distortion Example Overview

Why Visualize? The Power and Peril

Anscombe’s Quartet - The Famous Example

Identical Statistics, Different Stories

Anscombe’s Quartet Data
Variable N Mean SD P0 P25 P50 P75 P100
x1 11 9.0 3.32 4.00 6.50 9.00 11.50 14.00
y1 11 7.5 2.03 4.26 6.31 7.58 8.57 10.84
x2 11 9.0 3.32 4.00 6.50 9.00 11.50 14.00
y2 11 7.5 2.03 3.10 6.70 8.14 8.95 9.26
x3 11 9.0 3.32 4.00 6.50 9.00 11.50 14.00
y3 11 7.5 2.03 5.39 6.25 7.11 7.98 12.74
x4 11 9.0 3.32 8.00 8.00 8.00 8.00 19.00
y4 11 7.5 2.03 5.25 6.17 7.04 8.19 12.50
Visuals Matter

Why Visualize? The Power and Peril

Anscombe’s Quartet - The Famous Example

Key Takeaways

  • Summaries lose information
  • Details matter
  • Visualization reveals structure that statistics hide

Describing Variables: Building Blocks

Describing Variables

Types of Variables

What according to Munzner (2025)

What we typically deal with…

  • Continuous: Price, returns, income
  • Count: Number of mergers, trades
  • Ordinal: Credit ratings (AAA, AA, A…)
  • Categorical: Industry, country
  • Binary: Default/no default

Describing Variables

Understanding Distributions

Common Distribution Shapes

What is a distribution?

  • The pattern of values a variable can take
  • Key question: How is the data spread out?
  • Examples:
    • Are most returns clustered around zero?
    • Do some firms have extremely high market caps?
    • Is income evenly distributed or skewed?

Describing Variables

Visualizing Distributions - Continuous Variables

Histograms

The Classic Choice

  • Divides data into bins
  • Height → frequency in each bin

Describing Variables

Visualizing Distributions - Continuous Variables

Histograms

The Classic Choice

  • Divides data into bins
  • Height → frequency in each bin
  • Design choices matter!:
    • Bin width too narrow → noise
    • Bin width too wide → oversmoothing

Describing Variables

Visualizing Distributions - Density Plots

The Smooth Alternative

  • Continuous curve estimate
  • No arbitrary bin choices
  • Easier to compare multiple distributions

Describing Variables

Visualizing Distributions - Density Plots

The Smooth Alternative

  • Continuous curve estimate
  • No arbitrary bin choices
  • Easier to compare multiple distributions

Describing Variables

The Numbers Behind the Picture

Summary Statistics

Location (central tendency)
  • Mean: Average value
  • Median: 50th percentile
  • Mode: Most frequent value
Spread (variability):
  • Standard deviation: Average distance from mean
  • Variance: SD squared
  • Range: Max - Min
  • Interquartile range (IQR): Q3 - Q1
Percentile

→ value below which X% of observations fall

Examples:

  • 25th percentile (Q1): Bottom quarter
  • 50th percentile (Median): Middle
  • 75th percentile (Q3): Top quarter
  • 1st/99th percentile: Extreme values (outliers?)
Putting it all together (Box Plot)

Describing Variables

Describing Relationships Between Variables

Moving Beyond Univariate Analysis

Key questions
  • How do two variables move together?
  • Is there a pattern?
  • Is the relationship linear or nonlinear?
Tools
  • Scatter plots
  • Correlation
  • Time series plots (for temporal relationships)

Describing Variables

Describing Relationships Between Variables

The Scatter Plot

Two Continuous Variables

What to look for

  • Direction (positive/negative)
  • Strength (tight cluster vs. diffuse)
  • Linearity (straight line or curve?)
  • Outliers

Describing Variables

Describing Relationships Between Variables

Conditional Relationships

Term Meaning Example
Unconditional (or marginal) distribution The overall spread of a variable, ignoring all other variables. Height distribution of all children.
Conditional distribution The spread of one variable given a specific value of another variable. Distributions of vitamin E intake among those who exercise vigorously versus those who do not.
Unconditional mean The average value of a variable, ignoring all other variables. Average height of all children.
Conditional mean The average value of a variable given a specific value of another variable. Average vitamin E intake among those who exercise vigorously versus those who do not.

Describing Variables

Describing Relationships Between Variables

Correlation vs. Causation

Correlation vs. Causation

The Most Important Distinction in Statistics

  • Correlation: X and Y move together
  • Causation: X causes Y

Describing Variables

Describing Relationships Between Variables

Time Series Plots

Daily Study Hours Over Time”

Basics of Time Series Plots

  • X-axis: Time
  • Y-axis: Variable of interest
  • Connect the dots in chronological order

Essential for …

  • Event studies (Lecture 5)
  • (Parallel) Trend analysis (Lecture 4)
  • Structural breaks
  • Seasonality

Describing Variables

Describing Relationships Between Variables

Multiple Time Series - Comparison

VW vs. Competitors - TEASER A2

What this reveals
  • Contagion effects
  • Relative performance
  • Industry-wide vs. firm-specific shocks

Design Principles That Matter

Design Principles That Matter

Munzner (2025)’s Visualization Framework (Simplified)

Munzner (2025)

Three Important Questions

What? - What data do I have?

  • Tables, networks, time series, spatial data
  • We mostly work with: Tables (panel data)

Design Principles That Matter

Munzner (2025)’s Visualization Framework (Simplified)

Munzner (2025)

Three Important Questions

What? - What data do I have?

  • Tables, networks, time series, spatial data
  • We mostly work with: Tables (panel data)

Why? - Why am I visualizing?

  • Communication, exploration, verification

Sesign Principles That Matter

Munzner (2025)’s Visualization Framework (Simplified)

Munzner (2025)

Three Important Questions

What? - What data do I have?

  • Tables, networks, time series, spatial data
  • We mostly work with: Tables (panel data)

Why?- Why am I visualizing?

  • Communication, exploration, verification

How? - How should I encode the data?

  • Position, color, size, shape

Design Principles That Matter

The Data-Ink Ratio

Edward Tufte’s Core Principle

\[ \text{Data-ink ratio} = \frac{\text{ink used to display data}}{\text{total ink used}}\]

Goal: Maximize this ratio
  • Remove chart junk
  • Eliminate non-data ink
  • Let the data speak
Example

Design Principles That Matter

Choosing the Right Plot Type

Match Visualization to Data Structure

Data Structure Best Visualization
One continuous variable Histogram, density plot, box plot
One categorical variable Bar chart
Two continuous variables Scatter plot
Continuous over time Line plot
Multiple groups over time Line plot or small multiples
Part-to-whole Stacked bar (avoid pie charts!)

Design Principles That Matter

Color Theory Basics

Strategic Use of Color

Three palette types:
  1. Sequential: Light to dark (e.g., revenue growth)
  2. Diverging: Two-color scale with neutral middle (e.g., positive/negative returns)
  3. Categorical: Distinct colors for groups (e.g., different firms)

Rule

Use color to enhance, not decorate

Design Principles That Matter

Accessibility - Colorblind Considerations

~8% of men have color vision deficiency

  • Avoid red-green combinations
  • Use colorblind-safe palettes:
    • Viridis (sequential)
    • ColorBrewer (all types)

Design Principles That Matter

Accessibility - Colorblind Considerations

~8% of men have color vision deficiency

Avoid red-green combinations

Use colorblind-safe palettes: Viridis

Design Principles That Matter

Axis Integrity

Y-axis should start at zero for bar charts

Exceptions

  • Line plots (time series can start at sensible minimum)
  • When showing small changes in large numbers
  • BUT: Always be transparent about scale choices

Best practice

  • Use non-zero axis when changes matter
  • BUT: Clearly label and acknowledge the choice
  • Never hide the axis range

Design Principles That Matter

Typography Matters

\[ \text{Readable} > \text{Beautiful} \]

Font guidelines

  • Minimum size: 8pt for print, 12pt for presentations
  • Font families:
    • Sans-serif for plots (Arial, Helvetica)
    • Serif for text if desired (Times, Garamond)
  • Consistency: Use same fonts throughout paper/presentation

Labels

  • Axis labels: Clear, concise
  • Title: Informative, standalone
  • Legend: Only if necessary (prefer direct labels)

Design Principles That Matter

The Iteration Process

Tables

Should be self-explanatory! Facts!

Tables

The Unsung Hero of Academic Communication

Tables are better when…

  • Exact values matter (not just patterns)
  • Many variables to compare
  • Formal hypothesis testing results
  • Standard format expected (e.g., regression output)

Plots are better when…

  • Pattern matters more than precision
  • Large amounts of data
  • Relationships are key

Tables

The Unsung Hero of Academic Communication

Summary Statistics Table - Anatomy

Standard Format (imo)
Variable N Mean SD Min P25 Median P75 Max
What this shows
  • N: Sample size (are there missing values?)
  • Mean: Central tendency
  • SD: Dispersion
  • Min/Max: Range (outliers?)
  • Percentiles: Distribution shape
What to include
  • Dependent variable(s)
  • Key All other variables
  • If panel data: Note firm-year structure
  • Notes on data source, transformations, etc.
  • Adhere to journal/MSc thesis style guide

Tables

Table Design Principles

Making Tables Readable

Alignment
  • Numbers: Right-align
  • Text: Left-align
  • Headers: Center acceptable
Precision
  • 2-3 decimal places usually sufficient
  • Be consistent across columns
  • Consistent between coefficients and fit stats
Formatting
  • Use rules (horizontal lines) sparingly
  • White space is your friend
  • Bold for emphasis (e.g., column headers)
Notes
  • Define variables
  • Explain sample restrictions
  • Note data sources

Tables

Table Design Principles

Example & Teaser for Assignment 6

Huck (2024), Table 2

Takeaways

Takeaways

What to Remember

  • Visualization is powerful - for communication AND deception
  • Describe first, model later - understand distributions and relationships
  • Design principles matter - data-ink ratio, color, accessibility
  • Tables are essential - especially in academic work
  • Iteration improves quality - first draft is never the last draft

The big picture

Good visualization and tables are core skills, not optional extras.

Teaser

Next Week Preview

Regression Methods and Identification

Building on today

  • We can now DESCRIBE data
  • Next: We learn to MODEL relationships
  • OLS assumptions
  • What is identification?
  • Endogeneity and causation

The connection

  • Today: “What does the data show?”
  • Next week: “What does it mean?”

Thank You for Your Attention!

See You in the Next One!

References

Beattie, Vivien, and Michael John Jones. 1992. “The Use and Abuse of Graphs in Annual Reports: Theoretical Framework and Empirical Study.” Accounting and Business Research 22 (88): 291–303.
Huang, Shaio-Yan, Shi-Ming Huang, Tung-Hsien Wu, and Tung-Yen Hsieh. 2011. “The Data Quality Evaluation of Graph Information.” Journal of Computer Information Systems 51 (4): 81–91.
Huck, John R. 2024. “The Psychological Externalities of Investing: Evidence from Stock Returns and Crime.” The Review of Financial Studies 37 (7): 2273–2314.
Huntington-Klein, Nick. 2022. The Effect: An Introduction to Research Design and Causality. 2nd ed. Chapman; Hall/CRC.
Munzner, Tamara. 2025. “Visualization Analysis and Design.” In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Courses, 1–2.

Appendix

Appendix

Types of Variables

Type What it represents Key characteristics Typical examples
Continuous Real‑valued measurements that can take any value within a range (often infinite). No natural “next” value; can be split to arbitrary precision. Monthly income, height, temperature.
Count Non‑negative integers recording counts of occurrences. Cannot be negative; usually discrete, but can sometimes be treated as continuous when counts are large. Number of mergers in a year, number of accidents.
Ordinal Categories with a meaningful order but unknown spacing. “Higher” vs “lower” is defined but not how much higher. Neuroticism level (low / medium / high), education level.
Categorical (nominal) Distinct, unordered categories. No inherent ordering. Flower colour, blood type, eye colour.
Binary Special case of categorical with exactly two possible values (often “yes”/“no”). Simplifies analysis; can be expanded into several binary variables. Military service (yes/no), treatment assignment.
Qualitative Text, images, or other non‑numeric information that cannot be straightforwardly categorised. Requires transformation or summarisation before quantitative analysis. Newspaper headline, interview transcript.

Appendix

Plot types overview

Proportions

Single Distribution

Appendix

Plot types overview

Multiple Distributions

Appendix

Plot types overview

Proportions - 2 + Groups